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Abstract. Spectral element methods are high-order weighted 
residual techniques for partial differential equations that 
combine the geometric flexibility of finite element methods 
with the rapid convergence of spectral techniques. In this 
paper we describe spectral element methods for the 
simulation of incompressible fluid flows, with special 
emphasis on implementation of spectral element techniques on 
me d i um- grained parallel processors. Two parallel 
architectures are considered: the first, a commercially 

available message-passing hypercube system; the second, a 
developmental reconfigurable architecture based on Geometry- 
Defining Processors. High parallel efficiency is obtained 
in hypercube spectral element computations, indicating that 
load balancing and commumi c a t i on issues can be successfully 
addressed by a high-order technique/medium-grained processor 
a 1 gor i thm- ar chi t e c tur e coupling. 

1. Introduction. Spectral element methods are high-order 
we i gh t ed - r e s i dua 1 techniques for partial differential 
equations that exploit both the common foundations and 
competitive advantages of h-type finite element methods 
(Strang and Fix, 1973) and p-type spectral techniques 
(Gottlieb and Orszag, 1977). In the spectral element 
discretization (Patera, 1984; Maday and Patera, 1987), the 
computational domain is broken up into ma c r o - spe c t r a 1 
elements, and the dependent and independent variables are 
approximated by Nth order tensor-product polynomial 
expansions within the individual subdomains. Variational 
projection operators and Gauss numerical quadrature are used 
to generate the discrete equations, wh ich are then solved by 
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direct or iterative procedures based on t e ns o r - p r oduc t sum- 
factorization techniques. Convergence to the exact solution 
is achieved by increasing the degree, N, of the polynomial 
approximation, while keeping fixed the number and identi ty 
of the underlying spectral elements. 

The spectral element discretization is an example of 
domain decomposition in that an individual spectral element 
is the "basic unit” as regards strong coupling between 
de g r e e s - o f - f r e e dom. In the spectral element discretization 

the elements serve as the basic building blocks for mesh 
generation and physics, as well as the basic units of 
locally structured mesh. The latter is critical as regards 
the computational efficiency of the method, as it allows for 
the sum- f ac t or i zed tensor -produc t matrix-vector products 
that render high-order methods competitive with low-order 
techniques . 

.Although spectral element methods have proven 
competitive with h-type finite element methods for 
incompressible flow simulations on serial and vector 
computers (Chaddar, Magen, Mikic and Patera, 1986), it is 
clear that in the future techniques mus t be judged on the 
basis of their performance on parallel machines. 

Fortunately, whereas global spectral methods are not 
obviously pa r a 1 1 e 1 i zab 1 e due to the strong coupling of all 
points in the computational domain, spectral element methods 
have an intrinsic "domain-decomposition” granularity that in 
turn leads to natural and efficient implementation on 
med i um- g r a i ned parallel processors. 

In this paper we focus on the parallel implementation 
of spectral element algorithms. In Section 2 we describe 
the basic spectral element discretization as applied to 
elliptic problems. In Section 3 we discuss the parallel 
implementation of spectral element elliptic algorithms on a 
me s s ag e - pa s s i ng hypercube system, and present results for 
parallel efficiency on a commercially available Intel 
machine. The viability of the high-order method/medium- 
grained approach to parallel computing is discussed, and the 
issues of communication latency and 1 oad - ba 1 anc i ng are 
addressed. Lastly, in Section 4 we present a developmental 
special-purpose r e c onf i gur ab 1 e architecture, Geometry- 
Defining Processors, and discuss the advantages of parallel 
implementation of spectral element and substructured finite 
element algorithms on this novel configuration. 

2. Spectral Element Methods. 

Elliptic Equations . To illustrate the basic spectral 
element concepts we first consider the following one- 
dimensional elliptic Helmholtz problem: Find u(x) defined 
over -4«]-l,l[ such that 

- ( pu ’ ) ’ + A 2 u - f x6il 
u(-l) = uCl) = 0 


(2.1a) 

(2.1b) 
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Here prime denotes differentiation, and we assume that p(x) 

> p > 0, and A 6 R. The spectral element discretization for 
solving (2.1) is based on the variational formulation: 

Find u € Hq(A) such that 

(2,2) a ( u , v ) - ( f , v ) Vv€HjU) , 

whe r e 


a(^,V0 - / p(x)<£’ (x)V'* (x)dx + A 2 / <£(x)t/>(x)dx 
A A 


(<£,0) - / ^(x)^(x)dx . 

A 

XXV X V XXq 1 s the space of all functions which satisfy the 
homogeneous boundary conditions (2,1b), which are square 
integrable, and whose derivatives are also square 
integrable . 

The spectral element discretization proceeds by 
breaking up the interval A into K subinterval|^(spectral 
elements) ••*■'*£* where the length of the k interval is 
L. . The space of approximat ion for the solution ^ is then 
taken to be a high-order polynomial subspace of H^(A) , 

X h" H 0 ( ^ nP N t K (A } ’ wher * P N f K^ ) " U k P N^ il k ) ’ t J nd - P $r k\ lS A the 
space of all polynomials of degree _iN on the interval . 

To accurately evaluate the integrals associated with the 

bilinear forms in (2.3) we use Gaus s -Lo ba t t o Legendre 

numerical quadrature with quadrature points and weights 

given by (|.,p.), i-0,....,N. 

The abstract formulation of the resulting discrete 
problem is: Find u. € X. such that 


(2.3a) 
and , 
(2.3b) 


H» r * H 


1 


(2.4) a 


h,GL (u h 


v - 


^ f ’ V h^h ,GL 


Vv h €X h 


where subscript h,GL refers to Gaus s -Loba t t o numerical 
quadrature of the bilinear forms in (2.3). Convergence of 
the spectral element solution u^ to the exact solution u is 
then achieved by increasing the degree of the approximation, 
N, for fixed K. It is shown in Maday and Patera (1987) for 
this spectral (p-type) convergence strategy that the error 
goes to zero exponentially fast with N for sufficiently 
smooth solutions u. This should be contrasted to finite 
element h-type methods (N fixed, K => oo) for which algebraic 
convergence results. 

To proceed further, we represent our polynomial space 
with a nodal basis, that is, any function w. in is 
written in terms of elemental Lagrangian interpolants h^(r) 
through the Gaus s -Loba t t o Legendre points, 

. N 

(2.5a) w. ( r ) = E w.h.(r) x£.4, => r£vl , 

h i-0 1 1 k 

h.€P N U), h .(£.)-$.. Vi , j£{0 N} 2 , 


(2.5b) 
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where w* is the value of w h at the Q local point f . in the 
interval A . In order to ensure C -continuity and satisfy 
the boundary conditions we further require that 

(2.6a) wjjj - w 0 k+1 VkG{l K-l} , 

and 

(2.6b) w 0 ” W N * 0 


Expressing u, and v. in terms of the nodal basis (2.6), and 
choosing each test Function to be nonzero at only one global 
collocation point, we arrive at the following set of 
discrete equations: 


(2.7) 


K, 2 N 

E [ — E 

k-l 


N 

E 


L^j-0 q> 


k k 

P p*D .D . u. 

q q qi qj j 


♦ a 2 %.u.i 

2 1 1 


K, 

E 

k-l 


k r k 
— p . f . 

. i i 


where p and f 
point r^f^ in $ 
stiffness 
As a 


are the values of 
D. : -dh ,(£. ) /dr , 


p(x) ^nd f(x) at the local 
and E denotes the direct 


sunmal i on ^h i cd. incorporates (2.6). 
numerical gx ample we consider the problem 


x - . (2.1) 

with p(x)-e , f(x)-e (cosx-sinx), A-0 on x-]0,tt[, for 
the solution is u(x)--sinx. The particular spectral 
discretization corresponds to division of the domain 
K-2 similar spectral elements, each of order N. Fig. 
a plot of the maximum nodal error llu-u^l 1^. ^oo as a 
function of the total number of degrees-of -Tfeedom N 
Exponential convergence is achieved due to the fact 
solution is analytic. 

The spectral element discretization of multi- 
dimensional elliptic equations corresponds to a tensor- 
product extension of the one -d imens i ona 1 method. We 
consider here the two-dimensional Poisson equation on a 
domain 17 with homogeneous boundary conditions on the domain 
boundary dfl , 


wh i ch 
e 1 ement 
into 
1 shows 


-2N+1 . 
that the 


(2.8a) 

(2.8b) 


-An 

u 


f 

0 


in 17 
on 517 . 


In this case the domain is broken up into K quadrilateral 

elements 17^ 17^, and the corresponding discrete 

formulation is given by: Find u^ G such that 


(2.9) ( Vu h ’ Vv h ) h>GL 


^ f ’ V h^h ,GL 


Vv h GX h ’ 


where x h = H 0 (^^ P N .K^ * . He r 6 ’ and P N^k ) 

the space of ail piecewise polynomials or degree j<N with 


i s 
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respect to each spatial variable x and y. For an error 
analysis of the discretization (2.9) we refer to Funaro 
(1986) where it is shown that .spectral convergence obtains 
as N =>• oo for fixed K. 

A basis for the space is constructed by expressing 
any function w. in X. in t ens o r - pr oduc t form within each 
e 1 eme n t 17^ ( ( x , y ) € 17^ => ( r , s ) E AxA ) , 

v N N . 

(2.10) w* ( r , s ) « £ £ w . . h . (r)h.(s) 

* i-0 j-0 lJ 1 J 

where h. are the one -d imens i ona 1 Gaus s -Loba t t o Lagrangian 
interpolants defined in (2.5b). The solution u, , the 
t e s t f unc t i ons v^, and the (isoparametric) geometry C x ^ , y . ) 
are all expressed in terms of the nodal basis (2.10), ana 
the integrals in (2.9) are evaluated using Gaus s -Loba t t o 
Legendre x Gaus s -Loba t t o Legendre quadrature in (r,s). 
Choosing appropriate ’’delta” t e s t f unc t i ons v^ we arrive at 
thesetoflinearequations, 

(2.11) A JL - i X , 


where A is the discrete Laplace operator, jl is the 
(diagonal) mass matrix, and underscore denotes nodal 
unknowns . 

As a two-dimensional example we consider the problem 
(2.8a) on the domain 17- ( xE ] 0 , 1 [ , yE ] 0 , 1 + ^ s i n?rx [ ) with f-0 and 
u-sinx-e ^ . Fig. 2 shows the isoparametric spectral element 
discretization (K-4), while in Figure 3 we plot the maximum 
nodal error I lu-u^l 1^ l°° as a f unct i° n of the number of 
de g r e e s - o f - f r e edom in dne spatial direction. Note that 
although the domain is relatively deformed compared to the 
mapped rectilinear problem, exponential convergence to the 
analytic solution is obtained (Ronquist and Patera, 1987). 

Elliptic Solvers . In this section we address the problem of 
solution of the linear po s i t i ve -de f i n i t e symmetric systems 
resulting from spectral element discretization of self- 
adjoint elliptic equations. For large three-dimensional 
problems the most attractive approach is iterative solvers, 
both in terms of operation count and storage requirements. 

In order for any solution method to be efficient in this 
context, fast matrix-vector evaluation schemes are needed; 
for high-order methods this is effected by tensor-product 
sum factorization, which we now discuss. 

Consider the solution of the two-dimensional Poisson 
equation (2.8) in a rectilinear domain J7, resulting in the 
spectral element discretization (2.11). The matrix-vector 
product A JA. can be written in terms of an elemental sum 


K, 

£ 

k=l 


N 

£ 

m, n 


. k k 
A. . u 
i jmn mn 



( 2 . 12 ) 


A JL = 


Vi , jE { 0 , . . 
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Naive evaluation of (2.12) requires 0(N 4 ) operations per 
element,. and also requires 0(N ) storage as the elemental 
matrix A . . is, in general, full. Fortunately, due to the 
t ens o r - pr JWuc t representation (2.10), a typical elemental 
term in (2.12) can be f^ctojed as (consider here only the 
term corresponding to d / dx ) 


(2.13) [ 


N 

27 ] 

q-0 


qi 


p p . [ 

q j 


N 

E 

m=-0 


D u k . 
qm mj 


]]] Vi,jG{0, N} 


of 


0(N 3 ) 


and a s torage 


naive approach of 
jind then evaluating 


yielding an operation cgunt 
requirement of only 0(N ). 

For three-dimensional problems th£ 
first forming the elemental matrices A^ 

the elemental ma t r i x- ve c t o r produces A JlT results in an 
operation count and storage of 0(N ) per element, while the 
tensor^product sum factorization yields. an operation count 
of 0(N 4 ) and storage requirement of 0(N J ). It should be 
noted that for a geg^jal d - d imens i ona 1 problem, the 
operation count 0(N ) persists even for isoparametric 

discretization of non- s e pa r ab 1 e problems. 

Given the efficient t ens o r - pr oduc t evaluation, the 
system (2.11) is solved by preconditioned conjugate gradient 
iteration (Golub and Van Loan, 1983). First, we construct 
A, an h-type finite element approximation defined on the 
spectral element quadrature points (Orszag, 1980; Deville 
and Mund , 1985). The matrix A is then further approximated 
by its incomplete Cholesky decomposition, A, , which in turn 
is used as a pr e c ond i t i one r for A (Meijerini and Van der 
Vorst, 1977; Kershaw, 1978). As a test problem to 
demonstrate the preconditioned conjugate gradient algorithm, 
we consider solution of..the two-dimensional Poisson equation 
(2.8a), with f-e X , u-^e X ^ , and fi defined by 
( xE^O , 1 [ , yE ] 0 , 1 [ ) . The domain is discretized by an array 
K-K^ spectral elements, each of order N. In Figure 4 we 
plot the convergence history as a function of number of 
iterations for Kj-4 and N-5,7,9, and 11. The results 
demonstrate the significant savings 
preconditioning. 


o f 


due to effective 


Stokes Discretizations . To illustrate h ow the spectral 
element discretization of the Poisson equation extends to 
the full Stokes equations we consider here the following 
problem: Find a velocity u«(u,v) and a pressure p in the 
domain 17- ] - 1 , 1 [ x ] 0 , 2?r[ such that 


(2.14a) -uAm + Vp - f 
(2 . 14b) -divu = 0 


subject to the s emi -pe r i od i c boundary conditions 

(2.15a) Vy€]0,2Jr[, u(-l,y) = u(l,y) = 0 , 

(2.15b) Vx€]-l,l[, u( x , 0 ) - u(x,2tt) 
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Here v is the kinematic viscosity of the fluid, and f is the 
prescribed force. 

Due to periodicity the dependent variables can be 
written in terms of Fourier series in the y-direction, 


(2.16a) 

u(x,y) - H u n (x) 
n- -oo 

exp(_i_ny ) 

(2.16b) 

oo 

p(x,y) - £ p n (x) 

n« -oo 

exp(_i_ny ) 


where J_ - y/-^T • The Fourier representation (2.16) results in 
a set of decoupled equations for each Fourier mode n, the 
variational statement of which is given by (dropping the 
Fourier superscript and carat notation): For n € N* find 
u-(urv) in X-[Hq(A)]^ and p in M-Lv-4) such that 

2 

(2.17a) ^[(u j ,w x ) + n ( u ,w) ] - ( p .w^+Jjiz )-( f ,w) ,Vw-(w,z)GX 

(2.17b) ( q , u^+Jjiv ) * - 0 ,Vq£M , 

2 

where -4“]-l,l[, L is the space of square- integrable 
functions, * refers to complex conjugate, and (<£,V0 “ 

. For simplicity, we do not consider the case of n-0 . 
In the same fashion as for the elliptic problem (2.1), 
the spectral element discretization proceeds by breaking up 
the interval A into K subintervals and searching for a 
solution in polynomial subspaces of X and M. However, 
choosing the polynomial degree N to be the same for the 
velocity and the pressure leads to spurious modes in the 
pressure. It can be shown that an optimal strategy is to 
use a staggered mesh for the pressure for wh i c h the 
associated polynomial degree is N-2 (Bernardi and Maday, 
1987; Maday, Patera and Rdnquist, 1987a; Maday, Patera and 
Rdnquist, 1987b). The discrete formulation corresponding to^ 
(2.17) can then be stated as: Find u,. in X„- [H 0 (-A)nP f , „(A)] 
and p^ in ^ such that ‘ A ‘ 

(2.18a) "[( u Nx ’ w x ) N ,GL +n (u N ,w) N,GL^ ■ ( ' P N ,W x +JJ1Z - ) N,G 

- (f,w) N GL , Vw = (w,z)€X N 

(2.18b) (q,u Nx + i^v N )» )(J - 0 , VqQ^ , 

where the subscripts N,G and N,GL denote numerical 
quadrature based on the Gauss and Gaus s -Loba t t o points, 
respectively. Theoretical error estimates for the 
approximation error due to the discretization (2.18) can be 
found in (Maday and Patera, 1987; Maday, Patera and 
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Rdnquist, 1987a), in which spectral convergence is 
demonstrated as N => oo for fixed K. 

We now define the bases for the space X^xM^. In t gach 
element A the velocity is expanded in terms of N order 
Lagrangian interpolants h. through the Gaus s -Loba t t o 
Legendre points i-0, N , 

N . 

(2.19) u*(r) - E uv'h.(r) X&A. ^ iGA , 

N i-0 

while the pressure is expanded in terms of (N-2) 1 * 1 order 
Lagrangian interpolants h. through the Gauss Legendre points 
i-1 N- 1 , 1 

. N-l k~ 

(2.20) Pv[(r) - E p.h.(r) xG.4. => iGA . 

in i-1 1 1 K 

Note the Gauss points are naturally suited for the pressure, 
which need not be continuous across elemental boundaries. 

The expansions for the the velocity and the pressure are 
inserted into (2.18), appropriate ’’delta” t e s t f unc t i ons 
G X^j and q„ € are chosen, and we arrive at the 
discrete saddle problem, 

(2.21a) A JU. - D T 42. - A X 

(2.21b) -Cjl-H 

Here the matrix A is the discrete Laplacian, H is the mass 
matrix, 12 is the discrete gradient operator, and underscore 
refers to nodal unknowns. 

As a numerical example we consider the following test 
problem with i'-l.O, n-1 , 

(2.22a) (u,v) - ( - ( l + cos7rx) , J_sin7Tx) 

(2.22b) p - sinTrx 2 

(2.22c) (f,g) - (-(l + COS7Tx)/7T, _L(2 +tt )sin7rx) , 

on a domain A which is divided into K-2 similar spectral 
elements. We show in Figure 5 the error in the velocity and 
the pressure as the order N of the polynomial expansions is 
increased. As expected, exponential convergence is achieved 
due to the fact that the solution is analytic. In the case 
when the solution is less regular we obtain an algebraic 
convergence rate reflecting all the regularity of the 
solution (Maday, Patera and Rbnquist, 1987a). Note that we 
obtain spectral convergence with only limited continuity 
between elements due to proper choice of the weak form. 

Stokes Solvers . In this section we consider solution of the 
algebraic system of equations (2.21) resulting from the 
spectral element discretization of the Stokes problem 
(2.14). It should be noted that the algorithms presented 
here are equally appropriate for other types of variational 
discretizations, and are in fact extensions of the classical 
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Uzawa algorithm used in finite element analysis (Girault and 
Raviart, 1986; Bristeau, Glowinski and Periaux, 1987). 

Our approach (Maday, Meiron, Rdnquist and Patera, 1987) 
to solving (2.21) is a global iterative procedure in which 
the original coupled saddle problem is decoupled into two 
positive-definite symmetric forms, 

(2.23a) -A Jl + IiV - £ X , 

(2.23b) £ £ - J2 A'*£ X . 

whe r e 

(2.24) £ - D. A _1 J2 T • 


In the solution process (2.23b) is first solved for _p. and 
then (2.23a) is solved for jl with _p. known. 

The equation for the velocities jl correspond to 
standard Laplacian solves, and the inversion of the matrix A 
is done by conjugate gradient iteration. Although the 
matrix £ in th^ equation for the pressure is full due to the 
presence of A , the matrix is extremely we 1 1 - cond i t i oned . 

In particular, it can be sh own that the ma t r i x £ is 
spectrally close to the variational equivalent of the 
identity matrix _L, namely the mass matrix £ defined on the 
pressure mesh. This suggests an inner/outer conjugate 
gradient iteration procedure in which the outer iteration 
corresponds to inverting the matrix £ preconditioned with 
the diagonal mass matrix £, and the inner iteration 
corresponds to inverting the discrete Laplacian, A. As the 
condition number for the matrix £ £ is of order unity, the 

above algorithm requires only order unity elliptic solves, 
and is therefore an ideal decoupling of the Stokes problem. 

We now present numerical results demonstrating the good 
conditioning of the matrix £ for the semi-periodic model 
problem (2^14) and (2.15). Figure 6 shows the spectrum 
A^(n) of £ X £ for the spectral element discretization with 
K=4 , N-7 , and wave number n«l . The clustering of the 
spectrum around unity is apparent, and a comparison with the 
continuous operator spectrum is seen to be virtually exact 
(Maday, Mei-ron, Rdnquist and Patera, 1987). For multi- 
dimensional Stokes problem it can be shown that the 
conditioning k* of £’ X £ is still of order unity, allowing 
for rapid convergence of the outer iteration. It should be 
noted that the good conditioning of £ £ is directly related 

to good approximation of the pressure (Maday and Patera, 
1987) . 

The discretization and solution of the steady Stokes 
equations (2.14) can be readily extended to solve the 
unsteady Stokes equations. The implicit Stokes algorithm 
can be further extended to solve the unsteady Nav i e r - S t oke s 
equations by explicit treatment of the nonlinear term using 
a third-order Adams -Ba shf o r th scheme. The resulting Navier- 
Stokes algorithm can be viewed as an implicit Stokes scheme 
with an "augmented” force which includes the explicit 
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convective contributions. Numerous spectral element 
calculations of unsteady flows have been performed to date 
(Ghaddar, Korczak, Mikic and Patera, 1986; Ghaddar, Magen, 
Mikic and Patera, 1986). 

3. Implementation on a Hypercube. 

Tntroduct ion . It is clear that the new direction of high 
performance architectures will entail large scale 
parallelism, and that successful algorithms must be able to 
exploit these new technological advances. The spectral 
element discretization and solution algorithms described in 
the previous section have been designed with parallel 
architecture in mind; in this section, we outline some 
parallel algorithm concepts, and describe the implementation 
of the spectral element method on the Intel iPSC Hypercube 
parallel processor. 

The Intel iPSC Hypercube is typical of the general 
class of parallel architectures for wh ich our algorit hms are 
appropriate. The Intel machine is a true multiple ^ 
i ns t r uc t i on-mu 1 1 i p 1 e data parallel processor, with M-2 
processors arranged in a classical hypercube fashion (Saad 
and Schul t z , 1985 ) . Each processor has local source code 
operating on local data, with execution proceeding 
asynchronously until data is needed from another processor 
in the system. Information is exchanged between processors 
via a message passing scheme, in which data_send and 
data_receive statements are issued by the participating 
processors. Efficient use of all the processors can be 
achieved only if the ratio of communication time to 
computation time is small, and if all the processors have 
the same amount of work, that is, are balanced as regards 
their computational load. 

The key aspects of the spectral element formulation of 
the Navier-Stokes probl em wh ich all ow for successful 
parallelization are the following: 

a) use of a d oma in dec omp osition approach; 

b) limited coupling between domains; 

c) high order discretization within domains; 

d) iterative solution techniques. 

Point (a) addresses the central issue of how one divides the 
problem among the available processors. Following the 
domain decomposition approach, one or more spectral elements 
are assigned to each processor, with all the coefficients, 
geometry, and data associated with each element being 
locally resident. For our problem, the most natural 
decomposition is geometric, that is, adjacent elements 
reside on the same processor so that inter-processor 
communication can be minimized. An alternative approach is 
to use a random element placement such that incremental work 
associated with local mesh refinement is distributed among 
several processors (Fox and Otto, 1985). 
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Points (b) and (c) address the requirement that there 
be minimal conmunicat ion/computat ion if^high parallel 
efficiencies are to be attained. The C continuity 
condition at elemental i n t e r f a c e s 2 imp 1 i e ^ that the inter- 
element communication will be 0(N ) in R , while the high- 
order discretization implies that the computational work 
will be 0(N 4 ) (see Section 2). In essence, high-order 
methods imply a large amount of computation per point, thus 
resulting in a low ratio of communication to computation. 
Point (d) essentially addresses the load balance issue; 
iterative procedures such as conjugate gradient iteration 
are more readily amenable to node - s imul taneous calculation 
than direct methods such as Gaussian elimination. By the 
same argument standard conjugate gradient preconditioning 
algorithms, such as incomplete Cholesky factorization, are 
not readily pa r a 1 1 e 1 i zab 1 e ; work is underway to find 
suitable alternatives. 

Algorithm Description . At the heart of the parallel Navier- 
Stokes algorit hm are the elliptic solvers wh i c h emp 1 o y 
(unpreconditioned) conjugate gradient iteration (Golub and 
Van Loan, 1983). To illustrate the parallel implementation 
we describe the central step in this procedure, which is 
formation and evaluation of matrix vector products of the 
type : 

(3.1) _r = Ap 

where_r is the desired global result vector, jl is an 
intermediate search direction, and A is the global Lapiacian 
operator defined in Section 2. By definition of the 
variational statement (2. 2-2. 3), (3.1) can be written as: 

(3.2) _r - £A k _p k 

where A is the local Laplace operator^coupl ing the nodal 
unknowns within the kth element, and jj. represents the 
values of _p. in element k. 

To compute x in parallel, we first calculate 

(3.3) I k - A k X k k=l,2 K , 

and then compute 

(3.4) x - £ X k . 


Here direct stiffness corresponds to summation of all 
elemental nodal values corresponding to the same global 
degree -of -f reedom. Note that in practice (3.4) is not a 
global operation, but corresponds to local exchange of face 
data between neighboring spectral elements. The sequence of 
steps (3. 3-3. 4) is depicted graphically in Figure 7. 

The only other computational step in the conjugate 
gradient algorithm requiring inter-processor communication 
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is the evaluation of inner products of the intermediate 
result vectors. This is readily handled using a 
substructuring technique, whereby local contributions to the 
inner product are computed on individual processors and then 
summed via a b inary- 1 ree - 1 ike gather and sum. The final 
result is then redistributed to all the processors via the 
same binary tree, resulting in a total communication count 
of 2D, where D is the dimension of the hypercube network 
(Saad and Schultz, 1985). The communication associated with 
the inner products is therefore small compared to that 
associated with direct stiffness summation. 

Since the direct stiffness summation (3.4) involves 
only the faces of the spectral elements it is an 0(N ) 
operation (per element)^ in R , whereas the matrii-vector 
product (3.3) is an 0(N ) operation. The rat^o of 
calculation to communication is therefore 0(N ), which will 
typi.cally be quite large and will thus result in non- 
communication bound simulations. ^Note also that the direct 
stiffness sum message length of N z will be relatively large, 
and thus startup costs associated with each data_send will 
be amortized over a large number of words, particularly on 
pipelined configurations. Thus, in theory, the spectral 
e 1 ement / c on j uga t e gradient algorithm suffers little 
communication overhead when implemented on a distributed 
memory parallel processor. 

As an example of the communication overhead incurred in 
practice, we plot in Figure 8 parallel efficiency tj vs. D 
for solution to a Poisson equation in R . Parallel 
efficiency is defined here as: 

T 1 

(3.5) V , 

mt m 

where M=2 D is the number of processors, and T is the time 
required to solve a given problem on m processors. The 
efficiencies range from 0.96 to 0.99 for D ranging from 0 to 
5. The K=32 spectral element discretization used in this 
example is of moderate order, N=6; higher order solutions 
will show a further increase in parallel efficiency. 

The actual efficiency obtained will of course be a 
strong function of the particular hardware used. The 
results of Figure 8 were obtained on a D=5 iPSC Hypercube 
without high performance vector processors. Increases in 
floating point performance will increase the absolute 
calculation speed, but will at the same time make the 
parallel efficiency worse. The question remains as to 
whether communications capability will continue to keep pace 
with the rapid advances in floating point hardware. 

Virtual Parallel Processor . Central to the development of 
the parallel spectral element algorithm is the treatment of 
each element as a separate entity, that is, as a virtual 
processor. All data structures are set up on an element by 
element basis, and all operations are executed element by 


13 


element on this e 1 emen t - 1 o c a 1 data. In this way, 
parallelism is inherent in the algorithm, and not overly 
dependent upon the particular machine on which the code is 
run . 

The practical implementation of this virtual parallel 
processor concept is relatively simple. The critical 
feature is that all outer loops in the code run over the 
number of elements; on a serial machine, loops extend over 
the entire domain, while on a parallel machine, loops extend 
over the subset of elements associated with a given 
processor. The source code on each of the processors is 
therefore identical. The spectral e 1 ement - v i r tua 1 processor 
association further insures that the few operations 
requiring communication between processors (direct stiffness 
summation, inner product evaluation) can be easily effected 
by device drivers which translate send and receive 
subroutines into ha r dwa r e - c ompa t i b 1 e data_send and data_ 
receive commands. 

Alternative Algorithm/Architecture Coupling s . We now 
consider our particular choice of architecture and algorithm 
in light of other available alternatives. Our algorithm is 
well suited for what we define as me d i um- g r a i ned parallel 
processing. Defining N to be the total number of degrees- 
of-freedom, M to be the number of processors, and p - N /M, 
we classify systems as 

coarse-grained: N => oo, M fixed, p => 0 ; 

medium-gr*a ined : ^oo, M =>oo, p<< 1, (p =4 0) ; 

fine-grained: N =»• oo, M => oo, p => 1 

Coarse-grained machines do not exploit the full parallel 
potential of our algorithm and are consequently not of 
interest. It is also fairly clear, a priori , that high- 
order methods will not readily parallelize on fine-grained 
systems due to communication considerations. We therefore 
restrict our comparison to h i gh- o r de r /me d i um- g r a i ned 
couplings versus 1 ow- o r de r / f i ne - g r a i ne d couplings. 

We first note that a low-order approach is necessarily 
going to require many more points to achieve the same level 
of accuracy as the spectral element method. Hence, the 
fine-grained system will need many more processors than the 
number of de g r e e s - o f - f r e edom in the spectral element 
solution. Second, for each of the fine-grained processors 
the ratio of communication time to computation time will be 
relatively large, as the number of words transmitted during 
direct stiffness summation will be of the same order as the 
number of calculations required to evaluate the local 
residual. Third, the messages will of necessity be short on 
a fine-grained processor, and the advantages of pipelined 
communication may not be realized. 

Next, we consider the issue of mesh refinement. 

Adaptive resolution is readily implemented in the spectral 
e 1 erne n t /med i um- g r a i ne d algorithm. This is effected by 
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increasing the order of the polynomial approximation in the 
vicinity of the desired high resolution area (Mavr i p i 1 i s ) ; 
due to spectral accuracy, an order of magnitude improvement 
can be obtained at the cost of only a few additional points 
in each spatial direction. Because of this exponential gain 
in accuracy, and because the work per processor is already 
large, the incremental work incurred due to refinement will 
be a fraction of the total work on the processor. 

Therefore, the processor load imbalance resulting from this 
localized refinement will not be significant. In contrast, 
mesh refinement on a f i ne - g r a i ned/ 1 ow- o r de r implementation 
will probably require at least a doubling of the number of 
points on the relevant processors to improve local 
resolution. The incremental work increase will then be of 
the same order as the initial work, and the parallel 
efficiency for an initially well balanced problem will be 
reduced by a significant factor unless a large-scale load 
redistribution is carried out. 

Ultimately, the choice of parallel architecture and 
associated algorithm must be coupled to economic 
considerations which are difficult to predict. The above 
discussion should only be considered as plausibility 
arguments for a high-order me t hod/med i um- g r a i ne d coupling. 

4. Geometry Defining Processors (GDPs). 

Mo t i va t i on . The use of the hypercube for solution of finite 
or spectral element equations as described in Section 3 
results in efficient parallel solution of partial 
differential equations (PDE) in three dimensional 
geometries. However, when applied to this class of 
problems, the general-purpose hypercube communications 
network is too flexible, and, at the same time, non-optimal. 
In particular, as mo re processors are applied to the 
solution of a problem, the hypercube’s log(M) connections 
will grow, and eventually exceed, the number required for 
nearest neighbor communication; these unnecessary 
connections represent a significant increase in per 
processor cost and complexity. Furthermore, even with the 
available log(M) connections, computational problems in 
complex geometries will require some inter-processor 
communication to be routed through intermediate nodes, 
potentially resulting in increased communication latency. 

The issue of inter-processor communication raises a 
parallel processing issue that has not yet been explicitly 
considered in this paper. This is the mapping problem 
(Bokhari, 1979), which addresses how spectral (or finite) 
elements should be distributed onto a particular processor 
configuration. This mapping is typically performed by a 
(serial) front-end, wh ich is also responsible for the 
related activity of mesh generation. A number of computer- 
aided design systems have been designed to facilitate the 
transfer of a problem geometry from the physical domain to a 
computer representation, however the input and display of 
three dimensional objects and domains via essentially two 
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dimensional tools is often cumbersome, and contributes 
little to solution of the associated parallel processing 
mapping problem. 

Although the problems of efficient parallel solution 
and convenient mesh generation appear superficially 
unrelated, it can be seen that they are, in fact, intimately 
connected. In essense, mesh generation is related to the 
problem of specifying domain and connection topology, which 
is, in turn, a central issue in the subsequent parallel 
solution of the governing PDEs . In this section we briefly 
describe a special purpose parallel architecture that 
s imu 1 a t e ous ly addresses the mesh generation and parallel 
solution problems by exploiting their underlying 
similarities (Dewey and Patera, 1987). 

Geometry-Defining Processors . Ge ome t r y -De f i n i ng Processors 
(GDPs) are microprocessors housed in physical geometric 
packages which can be easily manually assembled or 
reconfigured to define any particular system geometry. An 
individual GDP is aware of the parameters of its physical 
package, is able to communicate with neighboring GDPs as 
well as with a host processor, and is capable of performing 
independent numerical computations. To solve a PDE , that 
is, to simulate a physical system, GDPs are assembled in a 
"scale" model of the actual geometry, as shown in Figure 9. 
The GDP assembly provides a means for geometry input, and 
subsequently functions as an opt ima 1 ly - c onne c t ed dedicated 
parallel processor for the particular problem at hand. 

The basic pr o c e s s o r / commun i c a t i on structure of a single 
GDP is shown schematically in Figure 10 for the simple case 
of a rectangular two-dimensional GDP. A communication port 
on each face of the GDP all ows for two types of 
communication within the GDP assembly. First, each GDP is 
able to communicate with neighboring GDPs through local 
busses. The local bus is reconf igurable , in that the lines 
emanating from different faces can be internally connected 
to route messages through the GDP; this allows high-speed, 
parallel, non-neares t -ne i ghbor communication (e.g., for 
vector reduction operations (Lin and Sips, 1986)). The 
second type of communication supported by the GDP faces is a 
global bus which all GDPs and the host computer can access. 
This bus is primarily used for communication between the 
host computer and one or more of the GDPs. 

Each GDP- resident microprocessor, in addition to 
controlling communications, performs local calculations 
related to mesh generation or equation solution using 
programs and parameters downloaded from the host computer 
via the global bus. The GDPs in an assembly are 
synchronized by the global bus, however they otherwise 
operate independently, performing autonomous calculations on 
1 o c a 1 ly - ava i 1 ab 1 e data. 

With the above description of an individual GDP, the 
following sections present details of the operation of GDP 
assemblies when applied to mesh generation and parallel 
solution. 
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Mesh Generation with GDPs . GDP mesh generation proceeds 
through several steps of communication and computation. 

After (manual) assembly of the desired geometry, the host 
interrogates the assembly of GDPs to determine the number, 
names, and geometry types of the constituent GDPs. Next, 
each GDP is instructed to determine and report any 
connections between itself and its neighbors. By virtue of 
their physical proximity, and hence communication proximity 
through the local busses, the information needed to 
determine these connections is locally available to each 
GDP. Finally, by ’’integrating” the geome t ry- type and 
connection information obtained from the GDP assembly, the 
ho s t - r e s i den t software can determine the absolute geometry 
of the assembly. 

With a philosophy similar to that of the graphics 
tablet, the GDPs allow for data input directly in physical 
space, with all symbolic, c ompu t a t i ona 1 - s pa c e processing 
performed transparently to the user. Both the minimal 
learning curve associated with initial use of GDPs, and the 
minimal time required for subsequent particular realizations 
of GDP assemblies, should make the devices a useful addition 
to conventional software geome t ry- input techniques. 

Parallel Solution with GDPs . Ge ome t r y -De f i n i ng Processor 
solution of partial differential equations proceeds as 
follows. The physical domain is first (manually) 
constructed as shown in Figure 9, and the system geometry is 
then determined by the host as described in the previous 
subsection. It should be noted that the geometry as input 
by the GDPs need only represent a ’’rough-cut” of the actual 
system geometry, as conventional computer-aided design tools 
can be used to subsequently tailor the GDP shapes in 
s o f twa r e . 

Next, each GDP is assigned a single, or mo re generally, 
a substructure of finite or spectral elements. Note that 
the mapping problem has been completely eliminated: with no 
computational overhead the pr oc e s s o r / e 1 emen t s are now 
optimally connected in that neighboring elements in problem 
space are represented by communicating processors in 
computational space. In addition to this nearest-neighbor 
communication many solution algorithms, such as conjugate 
gradient iteration, require global summation and broadcast; 
these vector reduction operations can be efficiently 
provided through the use of the r e conf i gur ab 1 e local bus. 
Thus, the GDP assembly next determines a ( t ime -dependent ) 
re-routing of the local busses that will allow these 
operations to be performed in log(M) time (Dewey and Patera, 
1987). Finally, the discrete equations, physical 
properties, and boundary conditions for each GDP are 
d own loaded fr om the host processor. 

With these mesh generation and equation definition 
preliminaries performed, the actual parallel solution is 
then initiated and subsequently synchronized by the host 
through the global bus. The operations carried out by the 
GDP s in this solution stage are essentially identical to 
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those now implemented on the hypercube processor as 
described in Section 3, with the advantage of an optimal 
connection topology that results in reduced communication 
latency and greatly reduced hardware and software 
c omp 1 e x i t y . 

Development of Prototype GDPs . We have developed prototype 
hardware GDPs and associated host software to perform simple 
three-dimensional geometry input. The GDP shapes chosen for 
these prototypes are the cubical and wedge elements shown in 
the example of Figure 9. These prototype GDPs have faces 
outfitted with optical communication ports for the local and 
global busses; a typical face is diagrammed in Figure 11. 
Optical emitters and detectors (rather than 

phy s i ca 1 /e 1 e c t r i c a 1 connections) are used in order to have 
rotationally independent, easily r e c onf i gur ab 1 e interfaces. 

A central port on each face is used for global 
communication, while the four ports along the periphery of 
each face allow for local communication and the 
determination of the rotational orientation of communicating 
faces. A 16 bit microprocessor with nominal amounts of RAM 
and EPROM in each GDP controls the communications ports and 
runs the hos t -downloaded mesh generation program. 

In operation, these prototypes allow simple three- 
dimensional geometries to be conveniently input and 
reconfigured. With these prototypes as a starting point, 
continued research aims to develop sufficient hardware and 
software to allow for a meaningful evaluation of the GDP 
concept as a me sh- g ene r a t i on and parallel solution 
architecture . 

Conclusion. In this paper, we have shown how high-order 
spectral element algorithms coupled to me d i um- g r a i ne d 
me s s age - pa s s i ng or ge ome t ry-de f i n i ng architectures can 
result in significant improvements in performance over 
conventional serial solution methods. The results presented 
demonstrate that the use of parallel processors for the 
solution of complex problems is no longer only a promise, 
but in fact a reality. Future gains in computational power 
will be the result of continued emphasis on parallelism and 
a 1 go r i thm- a r chi t e c t ur e coupling. 
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FIGURE CAPTIONS 

Figure 1. A plot of the maximum nodal error in 
the spectral element solution to the differntial 
equation (2.1) as a function of the total number 
of de g r e e s - o f - f r e edom, N . In this problem A-0 , 
p(x)-e X , f(x)-e X (cosx-sinx), for which the 
solution is u--s inx on x E ]0,7r[. The domain is 
divided into K-2 spectral elements of equal 
length. Exponential convergence is achieved as 
the polynomial degree of the fixed elements is 
increased. 

Figure 2. The domain and spatial discretization 
for solution of the Poisson equation (2.8a) with 
f-0. Dirichlet boundary conditions are imposed 
such that the solution is u-sinx*e The domain 
is divided into K=*4 spectral elements. 

Figure 3. A plot of the maximum nodal error in 
the Legendre spectral element solution of the 
Poisson equation (2.8a) as a function of the 
total number of de g r e e s - o f - f r e e dom in one 
spatial direction, N . Exponential convergence 
is achieved as the degree of the the elements is 
increased. 

Figure 4. Convergence history for preconditioned 
(closed symbols) and unpreconditioned (open 
s ymb ols) conjugate gradient solution of a 
spectral element Poisson discretization. In all 
cases K^-4, with N-5(^) , 7(0), 9(D), and ll(V). 

Figure 5. The error in the velocity u^=(u N ,v N ) 
and the pressure p^ as a function of the total 
number of degree-ot-freedom (Gauss-Lobatto 
Legendre points) in the x-direction, N , when 
solving the test problem (2.22). The total 
interval A=]- l,l[ is divided into K-2 spectral 
elements A —]-l 9 0[ and A^-]0 y \[. Exponential 
convergence is obtained. 

S 

Figure 6. A plot of th£ jpectrum A (n) of the 
preconditioned matrix £. .S., where Ji. the 

pressure matrix given in (2.24) and Jl is the 
mass matrix defined on the Gauss pressure mesh. 
The spectrum (/\) corresponds to a spectral 
element discretization K-4 , N-7 for a wavenumber 
n— 1 ; the agreement with the continuous operator 
s p e c t r um (O) is very good. 

Figure 7. Caption on figure. 
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Figure 8. Parallel efficiency vs. hypercube 
dimension for solution to three-dimensional 
Poisson equation using sixteen spectral 
e 1 ement s . 

Figure 9. Geometry-Defining Processor assembly 
corresponding to a physical domain D, which in 
this case is a block cooled by a heat-sinking 
fin. The GDP subsystem consists of the assembly 
of GDPs, a GDP interface block, and a controller 
through which the host communicates with the 
GDPs . 

Figure 10. Schematic of a (rectangular, two- 
dimensional) Geometry-Defining Processor. A GDP 
consists of a package, communication ports on 
each face, a microprocessor, a floating point 
unit (FPU), RAM and EPROM, a communications 
controller, a global bus (GB) , and a 
r e c onf i gur ab 1 e local bus (LB). 

Figure 11. Layout of a typical face of the 
prototype GDPs, showing the arrangement of 
emitters and detectors used for local and global 
communication. The use of four local detectors 
allows rotational orientation to be determined. 
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Figure 7. One dimensional example of parallel matrix-vector 
multiply, r-Ap . for three spectral elements. (a) Global and 
local representations of intermediate search direction _p.> 
where * indicates noda^ values at element interfaces which 
must be equal by the C continuity requirement. (b) Global 
form of the discrete Laplace operator, A, where + indicates 
s umma tion of overlapping contributions fr om local ma trices. 

Expansion of global product _^nto three local products, 
r , evaluated in parallel. The r are then direct stiffness 
summed to form the final result vector, x . Final ratio of 
computation to communication is O(fs) in all space 
d imens ions. 
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